Inference of Multiscale Gaussian Graphical Model

Authors
Affiliations
Published

September 23, 2022

Abstract

Gaussian Graphical Models (GGMs) are widely used for exploratory data analysis in various fields such as genomics, ecology, psychometry. In a high-dimensional setting, when the number of variables exceeds the number of observations by several orders of magnitude, the estimation of GGM is a difficult and unstable optimization problem. Clustering of variables or variable selection is often performed prior to GGM estimation. We propose a new method allowing to simultaneously infer a hierarchical clustering structure and the graphs describing the structure of independence at each level of the hierarchy. This method is based on solving a convex optimization problem combining a graphical lasso penalty in its neigborhood selection version <> with a fused type lasso penalty. Results on real and synthetic data are presented.

build status Creative Commons License

Show the code
knitr::opts_chunk$set(tidy = FALSE, out.width="49%", out.height="20%",fig.show='hold',fig.align='center')
options(htmltools.dir.version = FALSE)

library(ggplot2)
library(simone)
library(SpiecEasi)
library(huge)
library(Matrix)
library(ghibli)

1 Introduction

Probabilistic graphical models (Lauritzen 1996; Koller and Friedman 2009) are widely used in high-dimensional data analysis to synthesize the interaction between variables. In many applications, such as genomics or image analysis, graphical models reduce the number of parameters by selecting the most relevant interactions between variables. Undirected Gaussian Graphical Models (GGMs) are a class of graphical models used in Gaussian settings. In the context of high-dimensional statistics, graphical models are generally assumed sparse, meaning that a small number of variables interact, compared to the total number of possible interactions. This assumption has been shown to provide both statistical and computational advantages by simplifying the structure of dependence between variables (Dempster 1972) and allowing efficient algorithms (Meinshausen and Bühlmann 2006). See, for instance, (Fan, Liao, and Liu 2016) for a review about sparse graphical models inference.

In GGMs, it is well known (see, e.g., (Lauritzen 1996)) that inferring the graphical model or equivalently the Conditional Independence Graph (CIG) boils down to inferring the support of the precision matrix \mathbf{\Omega} (the inverse of the variance-covariance matrix). To do so, several \ell_1 penalized methods have been proposed in the literature to learn the CIG of GGMs. For instance, the neighborhood selection (Meinshausen and Bühlmann 2006) (MB) based on a nodewise regression approach via the least absolute shrinkage and selection operator (Lasso, (R. Tibshirani 1996)) is a popular method. Each variable is regressed on the others, taking advantage of the link between the so-obtained regression coefficients and partial correlations. More precisely, for all 1 \le i \le p, the following problem is solved:

\hat{\boldsymbol{\beta}^i}(\lambda) = \underset{\boldsymbol{\beta}^i \in \mathbb{R}^{p-1}}{\operatorname{argmin}} \frac{1}{n} \left \lVert \mathbf{X}^i - \mathbf{X}^{\setminus i} \boldsymbol{\beta}^i \right \rVert_2 ^2 + \lambda \left \lVert \boldsymbol{\beta}^i \right \rVert_1. \tag{1}

In Equation 1, \lambda is a non negative regularization parameter and \mathbf{X}^{\setminus i} denotes the matrix \mathbf{X} deprived of column i. The MB method defined by the estimation problem in Equation 1 has generated a long line of work in the field of nodewise regression methods. For instance, Rocha, Zhao, and Yu (2008), Ambroise, Chiquet, and Matias (2009) showed that nodewise regression could be seen as a pseudo-likelihood approximation, and Peng et al. (2009) extended the MB method to estimate sparse partial correlations using a single regression problem. Other inference methods similar to nodewise regression include a method based on Dantzig selector (Yuan 2010) and the introduction of the Clime estimator (Cai, Liu, and Luo 2011).

Another family of sparse CIG inference methods directly estimates \mathbf{\Omega} via direct minimization of the \ell_1-penalized negative log-likelihood (Banerjee, El Ghaoui, and d’Aspremont 2008), without resorting to the auxiliary regression problem. This method, called the graphical lasso (Friedman, Hastie, and Tibshirani 2007), benefits from many optimization algorithms (Yuan and Lin 2007; Rothman et al. 2008; Banerjee, El Ghaoui, and d’Aspremont 2008; Hsieh et al. 2014).

Such inference methods are widely used and enjoy many favorable theoretical and empirical properties, including robustness to high-dimensional problems. However, some limitations have been observed, particularly in the presence of strongly correlated variables. These limitations are caused by known impairments of Lasso-type regularization in this context (Bühlmann et al. 2012; Park, Hastie, and Tibshirani 2006). To overcome this, in addition to sparsity, several previous works attempt to estimate CIG by integrating clustering structures among variables for the sake of both statistical sanity and interpretability. A non-exhaustive list of works that integrate a clustering structure to speed up or improve the estimation procedure includes (Honorio et al. 2009; Ambroise, Chiquet, and Matias 2009; Mazumder and Hastie 2012; Tan, Witten, and Shojaie 2013; Yao and Allen 2019; Devijver and Gallopin 2018).

The above methods exploit the group structure to simplify the graph inference problem and infer the CIG between single variables. Another question that has received less attention is the inference of the CIG between the groups of variables, i.e., between the meta-variables representative of the group structure. A recent work introducing inference of graphical models on multiple grouping levels is (Cheng, Shan, and Kim 2017). They proposed inferring the CIG of gene data on two levels corresponding to genes and pathways, respectively. Note that pathways are groups of functionally related genes known in advance. The inference is achieved by optimizing a penalized maximum likelihood that estimates a sparse network at both gene and group levels. Our work is also part of this dynamic. We introduce a penalty term allowing parsimonious networks to be built at different hierarchical clustering levels. The main difference with the procedure of (Cheng, Shan, and Kim 2017) is that we do not require prior knowledge of the group structure, which makes the problem significantly more complex. In addition, our method has the advantage of proposing CIGs at more than two levels of granularity.

We introduce the Multiscale Graphical Lasso (MGLasso), a novel method to estimate simultaneously a hierarchical clustering structure, and graphical models depicting the conditional independence structure between clusters of variables at each level of the hierarchy. The procedure is based on a convex optimization problem with a hybrid penalty term combining a graphical Lasso and a group-fused Lasso penalty.

The approach is based on neighborhood selection (Meinshausen and Bühlmann 2006) and considers an additional fused-Lasso type penalty for clustering.

In the spirit of convex hierarchical clustering, introduced by Hocking et al. (2011), Lindsten, Ohlsson, and Ljung (2011), the hierarchy is obtained by spanning the entire regularization path. At each level of the hierarchy, variables in the same clusters are represented by a meta-variable; the new CIG is estimated between these new meta-variables, leading to a multiscale graphical model. Unlike Yao and Allen (2019), who introduced convex clustering in GGMs, our approach is expected to produce sparse estimations, thanks to an additional \ell_1 penalty.

The convex relaxation of hierarchical agglomerative clustering also known as sum of norms clustering can be formulated as follows. Given \boldsymbol X = \{ \boldsymbol x_1, \dots, \boldsymbol x_n\} \in \mathbb R^{n \times p}, Hocking et al. (2011) minimize with respect to the centroids matrix \boldsymbol \alpha \in \mathbb R^{n \times p} the criterion

\frac{1}{2} \sum_{i=1}^n \left \lVert \boldsymbol x_i - \boldsymbol \alpha_i \right \rVert_2^2 + \lambda \sum_{i < j} w_{ij} \left \lVert \boldsymbol \alpha_i - \boldsymbol \alpha_j \right \rVert_q \tag{2}

where \lambda is a sparsity penalization parameter, \{ w_{ij} \} are symmetric positive weights, \boldsymbol \alpha_i \in \mathbb R^p is the centroid to which observation \boldsymbol x_i is assigned to, and \left \lVert . \right \rVert_2 is the \ell_q-norm on \mathbb R^p with q \ge 1. The regularization path of solutions to problem in Equation 2 can lead to a tree structure i.e hierarchy under some conditions (Chi and Lange 2015; Chiquet, Gutierrez, and Rigaill 2017).

The use of fusion penalties in the inference of Gaussian graphical models is not new. Some approaches have considered the addition of a fusion penalty term to the graphical lasso global likelihood version, while infering graphical models accross multiple classes i.e. observations belonging to different known clusters in order to share characterics between the infered graphs (Danaher, Wang, and Witten 2014). Other approaches in the same vein as ours use fusion penalties (Yao and Allen 2019; Lin et al. 2020) to enforce local constancy among variables in the infered networks. Variables belonging to the same clusters are thus more likely to share the same neighborhood. To the best of our knowledge, the unpublised manuscript of Ganguly and Polonik (2014) is the first adaptation of a fusion-like penalty to the neighborhood selection framework. However the problem is solved in a columwise fashion where the p regressions problems are not combined. Note that fusion penalties have also been used simple regression problems like (Robert Tibshirani et al. 2005) in joint regressions problems like (see the works of Chen et al. (2010); Hallac, Leskovec, and Boyd (2015); Degras (2021); chu2021adaptive among others). However an explicit link have not been made with the Meinshausen and Bühlmann (2006) approach for undirected graphical model inference.

So, the key novelty of the MGLasso is the adaptation the global likelihood maximization problem to the neighborhood selection framework (Meinshausen and Bühlmann 2006) via the maximization of a pseudo-likelihood function. Moreover our fusion penalty differs from the ones recensed in the literature. We enforce fusion at the scale of the whole regression vector while permuting symmetric regression coefficients. Namely we use the general fusion term \sum_{i < j} \left \lVert \boldsymbol{\beta}^i - \tau_{ij}(\boldsymbol{\beta}^j) \right \rVert_2 where \tau_{ij} is a permutation function on symmetric coefficients.

The remainder of this paper is organized as follows. In section Multiscale Graphical Lasso, we formally introduce the Multiscale Graphical Lasso based on a convex estimation problem and an optimization algorithm based on the continuation of Nesterov’s smoothing technique (Hadj-Selem et al. 2018). Section Simulation experiments presents numerical results on simulated and real data.

2 Multiscale Graphical Lasso

The proposed method aims at inferring a graphical Gaussian model while hierarchically grouping variables. It infers conditional independence between different groups of variables. The approach is based on neighborhood selection (Meinshausen and Bühlmann 2006) and considers an additional fused-Lasso type penalty for clustering. In the spirit of hierarchical convex clustering, the hierarchical structure is recovered by spanning the regularization path.

Let X = (X^1, \dots, X^p)^T \in \mathbb R^p be a p-dimensional Gaussian random vector, with mean vector \mu and covariance matrix \mathbf \Sigma. The conditional independence structure of X is characterized by a graph G = (V, E), where V = \{1,\ldots p\} is the set of variables and E the set of edges, uniquely determined by the support of the precision matrix \mathbf{\Omega} = \mathbf{S}^{-1} (see, e.g., (Dempster 1972)). In other words, for any two vertices i,j\in V, the edge (i,j) belongs to the set E if and only if \Omega_{ij} \neq 0, that is if and only if the i-th and j-th variables are conditionally independent given all the others i.e. X^i \perp \!\!\! \perp X^j |X^{\setminus (i, j)} where X^{\setminus (i, j)} is the set of all p variables deprived of variables i and j.

Considering the linear model X^i = \sum_{j\neq i} \beta^i_j X_j + \epsilon_i where \epsilon_i is a Gaussian centered random variable, we have \beta^i_j = -\frac{\Omega_{ij}}{\Omega_{ii}}. We define the regression matrix \boldsymbol{\beta} := [{\boldsymbol{\beta}^1}^T, \ldots, {\boldsymbol{\beta}^p}^T]^T \in \mathbb{R}^{p \times (p-1)}, whose rows are the regression vectors for each of the p regressions.

Let the n \times p-dimensional matrix \mathbf{X} contain n independent observations of X. We propose to minimize the following criterion which combines Lasso and group-fused Lasso penalties:

J_{\lambda_1, \lambda_2}(\boldsymbol{\beta}; \mathbf{X} ) = \frac{1}{2} \sum_{i=1}^p \left \lVert \mathbf{X}^i - \mathbf{X}^{\setminus i} \boldsymbol{\beta}^i \right \rVert_2 ^2 + \lambda_1 \sum_{i = 1}^p \left \lVert \boldsymbol{\beta}^i \right \rVert_1 + \lambda_2 \sum_{i < j} \left \lVert \boldsymbol{\beta}^i - \tau_{ij}(\boldsymbol{\beta}^j) \right \rVert_2, \tag{3}

where \tau_{ij} is a permutation exchanging the coefficients \boldsymbol{\beta}^j_j and \boldsymbol{\beta}^j_i and leaves other coefficients untouched, \mathbf{X}^{i}\in \mathbb{R}^n denotes the i-th column of \mathbf{X}, \boldsymbol{\beta}_{i} denotes the i-th row of \beta, \lambda_1 and \lambda_2 are penalization parameters. Let us consider

\hat{\boldsymbol{\beta}} \in \underset{\boldsymbol{\beta}}{\operatorname{argmin}} J_{\lambda_1, \lambda_2}(\boldsymbol{\beta}, \mathbf{X}).

The lasso penalty term encourages sparsity and the penalty term \left \lVert \boldsymbol{\beta}^i - \tau_{ij}(\boldsymbol{\beta}^j) \right \rVert_2 encourages to fuse regression vectors \boldsymbol{\beta}^i and \boldsymbol{\beta}^j. These fusions uncover a clustering structure. The model is likely to cluster together variables that have the same conditional effects on the others. Variables X^i and X^j are assigned to the same cluster when \boldsymbol{\beta}^i = \tau_{ij}(\boldsymbol{\beta}^j).

Let us illustrate by an example the effect of the proposed approach. If we consider a group of q variables whose regression vectors have at least q non-zero coefficients and further assume that for each pair of group variables i and j, \|\boldsymbol{\beta}^i - \tau_{ij} (\boldsymbol{\beta}^j)\|_2=0. After some permutations, we get a q\times q block of non-zeros coefficient \beta_{ij} corresponding to the group in the \boldsymbol{\beta} matrix, where (i,j)\in \{1,\cdots, q\}^2. If we consider three different indices i,j,k \ \in \{1,\cdots, q\}^3, it is straightforward to show that the six coefficients indexed by (i,j,k) are all equal. Thus the distance constraints between vectors \boldsymbol{\beta}^i of a group forces equality of all regression coefficients in the group.

Let us illustrate by an example the effect of the proposed approach. Two variables i and j are in the same group when \|\boldsymbol{\beta}^i - \tau_{ij} (\boldsymbol{\beta}^j)\|_2=0. Considering a cluster \mathcal C of q variables, it is straightforward to show that \forall (i,j) \in \mathcal C^2, we have \beta_{ij}=\beta_{\mathcal C}. Thus the algorithm tend to produce precision matrices with blocks of constant entries for a given value of \lambda_2. Each block corresponds to a cluster.

Let us illustrate by an example the effect of the proposed approach. If we consider a group of q variables whose regression vectors have at least q non-zero coefficients and further assume that for each pair of group variables i and j, \|\boldsymbol{\beta}^i - \tau_{ij} (\boldsymbol{\beta}^j)\|_2=0. After some permutations, we get a q\times q block of non-zeros coefficient \beta_{ij} corresponding to the group in the \boldsymbol{\beta} matrix, where (i,j)\in \{1,\cdots, q\}^2. If we consider three different indices i,j,k \ \in \{1,\cdots, q\}^3, it is straightforward to show that the six coefficients indexed by (i,j,k) are all equal. Thus the distance constraints between vectors \boldsymbol{\beta}^i of a group forces equality of all regression coefficients in the group.

The greater the regularization weight \lambda_2, the larger groups become. This is the core principle of the convex relaxation of hierarchical clustering introduced by Hocking et al. (2011). Hence, we can derive a hierarchical clustering structure by spanning the regularization path obtained by varying \lambda_2 while \lambda_1 is fixed. The addition of a fused-type term in graphical models inference has been studied previously by authors such as Honorio et al. (2009), Ganguly and Polonik (2014), Grechkin et al. (2015). However, these existing methods require prior knowledge of the neighborhood of each variable. On the contrary, our approach allows simultaneous inference of a multi-level graphical model and a hierarchical clustering of the variables.

In practice, if some information about the clustering structure is available, the problem can be generalized into:

\min_{\boldsymbol{\beta}} \sum_{i=1}^p\frac{1}{2} \left \lVert \mathbf{X}^i - \mathbf{X}^{\setminus i} \boldsymbol{\beta}^i \right \rVert_2 ^2 + \lambda_1 \sum_{i = 1}^p \left \lVert \boldsymbol{\beta}^i \right \rVert_1 + \lambda_2 \sum_{i < j} w_{ij} \left \lVert \boldsymbol{\beta}^i - \tau_{ij}(\boldsymbol{\beta}^j) \right \rVert_2, \tag{4}

where w_{ij} is a positive weight encoding prior knowledge of the groups to which variables i and j belong to. A good choice for these pairwise affinities can help save time in computation and improve clustering performances. The choice can be guided by the application however improper specification can give rise to difficulty interpretable clustering structures (Chakraborty and Xu 2020). In the remainder of the paper, we will consider assume that w_{ij} = 1 for simplicity.

3 Numerical scheme

This section introduces a complete numerical scheme to apply MGLasso in practice, using a convex optimization algorithm and a model selection procedure. Section Optimization via CONESTA algorithm reviews the principles of the Continuation with Nesterov smoothing in a shrinkage-thresholding algorithm (CONESTA, (Hadj-Selem et al. 2018)), the optimization algorithm used in practice to solve MGLasso. Section Reformulation of MGLasso for CONESTA algorithm details a reformulation of MGLasso, which enables us to apply CONESTA. Finally, Section Model selection presents the procedure used to select the regularization parameters.

3.1 Optimization via CONESTA algorithm

The optimization problem for Multiscale Graphical Lasso is convex but not straightforward to solve using classical algorithms because of the fused-lasso type penalty, which is non-separable and admits no closed-form solution for the proximal gradient. We rely on the Continuation with Nesterov smoothing in a shrinkage-thresholding algorithm (CONESTA, Hadj-Selem et al. (2018)), dedicated to high-dimensional regression problems with structured sparsity such as group structures.

The CONESTA solver, initially introduced for neuro-imaging problems, addresses a general class of convex optimization problems which includes group-wise penalties. admitting loss functions of the form: The algorithm solves problems in the form f(\boldsymbol{\theta} ) = g(\boldsymbol{\theta}) + \lambda_1 h(\boldsymbol{\theta}) + \lambda_2 s(\boldsymbol{\theta}),

\operatorname{minimize} \quad f(\boldsymbol{\theta}) = g(\boldsymbol{\theta}) + \lambda_1 h(\boldsymbol{\theta}) + \lambda_2 s(\boldsymbol{\theta}), \tag{5}

with parameters vector \boldsymbol{\theta}\in \mathbb{R}^d where \lambda_1 and \lambda_2 are penalty parameters.

where \lambda_1 and \lambda_2 are penalty weights, and \boldsymbol{\theta}\in \mathbb{R}^d is a d-dimensional vector of parameters to optimize.

In the original paper (Hadj-Selem et al. 2018), the function g(\boldsymbol{\theta}) is the sum of a least squares criterion and a ridge penalty a differentiable function, h(\boldsymbol{\theta}) is a penalty function whose proximal operator \operatorname{prox}_{\lambda_1 h} is known in closed-form.

Given \phi \subseteq \{1,\ldots, d\}, let \boldsymbol{\theta}_\phi = (\theta_i)_{i \in \phi} denote the subvector of \boldsymbol{\theta} referenced by the indices in \phi. Denote \Phi = \{ \phi_1, \dots, \phi_{\operatorname{Card}(\Phi)}\} a collection with \phi_i \subseteq \{1,\ldots, d\}. Let the matrix \mathbf{A}_\phi \in \mathbb{R}^{m \times \operatorname{Card}(\Phi) } define a linear map from \mathbb{R}^{\operatorname{Card}(\phi)} to \mathbb{R}^m by sending the column vector \boldsymbol{\theta}_\phi \in \mathbb{R}^{\operatorname{Card}(\phi)} to the column vector \mathbf{A}_\phi \boldsymbol{\theta}_\phi \in \mathbb{R}^m. The function s(\boldsymbol{\theta}) is assumed to be an \ell_{1,2}-norm i.e. the sum of the group-wise \ell_2-norms of the elements \mathbf{A}_\phi \boldsymbol{\theta}_\phi, \phi \in \Phi. Namely,

and s(\boldsymbol{\theta}) is an \ell_{1,2} penalty of the form

s(\boldsymbol{\theta}) = \sum_{\phi \in \Phi} \|\mathbf{A}_\phi \boldsymbol{\theta}_\phi\|_2.

In the definition of s(\boldsymbol{\theta}), \Phi = \{ \phi_1, \dots, \phi_{\operatorname{Card}(\Phi)}\} is a set of subsets of indices, i.e., \Phi_i\subset \{1,\ldots, d\} for all i\in\{1,\ldots,\operatorname{Card}(\Phi)\} and, for all \phi\in \Phi, \boldsymbol{\theta}_\phi is the sub-vector of \boldsymbol{\theta} defined by \boldsymbol{\theta}_\phi = (\theta_i)_{i\in\phi}. Finally, \mathbf{D}_\phi are linear operators. <>

When \mathbf{A}_\phi is the identity operator such as \mathbf{A}_\phi \boldsymbol{\theta}_\phi = \boldsymbol{\theta}_\phi, the penalty function s is the overlapping group-lasso and m = \operatorname{Card}(\phi). When it is a discrete derivative operator, therefore s is a total variation penalty and m can be seen as the number of neighborhood relationships.

The main ingredient of CONESTA is the approximation of The non-smooth \ell_{1,2}-norm penalty with unknown proximal gradient , can be approximated by a smooth function with known proximal gradient computed using Nesterov’s smoothing (Nesterov 2005). Given a smoothness parameter \mu>0, let us define the smooth approximation

s_{\mu}(\boldsymbol{\theta}) = \max_{\boldsymbol{\alpha} \in \mathcal{K}} \left \{ \boldsymbol{\alpha}^T \mathbf{A} \boldsymbol{\theta} - \frac{\mu}{2} \| \boldsymbol{\alpha} \|_2^2 \right \}, where \mathcal{K} is the cartesian product of \ell_2-unit balls, \mathbf{A} is the vertical concatenation

of the matrices \mathbf{A}_\phi and \boldsymbol{\alpha} is an auxiliary variable resulting from the dual reformulation of s(\boldsymbol{\theta}) . Note that \lim_{\mu \rightarrow 0} s_{\mu}(\boldsymbol{\theta}) = s(\boldsymbol{\theta}). A Fast Iterative Shrinkage-Thresholding Algorithm (FISTA, (Beck and Teboulle 2009) step can then be applied after computing the gradient of the smooth part i.e. g(\boldsymbol{\theta}) + \lambda_2 s_{\mu}(\boldsymbol{\theta}) of the approximated criterion.

The main ingredient of CONESTA remains in the the determination of the optimal smoothness parameter using the duality gap, which minimizes the number of FISTA iterations for a given precision \epsilon. The specification of \mu is subject to dynamic update. A sequence of decreasing optimal smoothness parameters is generated in order to dynamically adapt the FISTA algorithm stepsize towards \epsilon. Namely, \mu^k = \mu_{opt}(\epsilon^k). The smoothness parameter decreases as one get closer to \boldsymbol{\theta} ^\star, the solution of the problem defined in Equation 5. Since \boldsymbol{\theta} ^\star is unknown, the approximation of the distance to the minimum is achieved via the duality gap. Indeed

\operatorname{GAP}_{\mu^k}(\boldsymbol{\theta}^k) \ge f_{\mu^k}(\boldsymbol{\theta}^k) - f(\boldsymbol{\theta}^\star) \ge 0.

We refer the reader to the seminal paper for more details on the formulation of \operatorname{GAP}_{\mu^k}(\boldsymbol{\theta}^k). The CONESTA routine is spelled out in the algorithm CONESTA solver where L(g + \lambda_2 s_{\mu}) is the Lipschitz constant of \nabla(g + \lambda_2 s_{\mu}). At each iteration of the CONESTA algoithm, the smoothing parameter \mu is updated dynamically using the duality gap, and a new approximation is computed. The CONESTA algorithm enjoys a linear convergence rate, and was shown empirically to outperform other computational options for structured-sparsity problems such as ADMM and inexact FISTA in terms of convergence speed (Hadj-Selem et al. 2018).

\begin{algorithm}
\caption{CONESTA solver}
\begin{algorithmic}
  \STATE \textbf{Inputs}: \\
    $\quad$ functions $g(\boldsymbol{\theta}), h(\boldsymbol{\theta}), s(\boldsymbol{\theta})$ \\
    $\quad$ precision $\epsilon$ \\
    $\quad$ penalty parameters $\lambda_1, \lambda_2$ \\
    $\quad$ decreasing factor $\tau \in (0,1)$ for sequence of precisions
    
  \STATE \textbf{Output:} \\
    $\quad$ $\boldsymbol{\theta}^{i+1} \in \mathbb{R}^d$

  \STATE \textbf{Initializations:} \\
    $\quad \boldsymbol{\theta}^0 \in \mathbb{R}^d$ \\
    $\quad \epsilon^0 = \tau \operatorname{GAP}_{\mu = 10^{-8}}(\boldsymbol{\theta}^0)$ \\
    $\quad \mu^0 = \mu_{opt}(\epsilon^0)$

  \Repeat
    \STATE $\epsilon^i_{\mu} = \epsilon^i - \mu^i \lambda_2 \frac{d}{2}$ \\
    \COMMENT{FISTA}
    \STATE $k=2$ \COMMENT{new iterator}
    \STATE $\boldsymbol{\theta}_{\operatorname{FISTA}}^1 = \boldsymbol{\theta}_{\operatorname{FISTA}}^0 = \boldsymbol{\theta}^i$ \COMMENT{Initial parameters value}
    \STATE $t_{\mu} = \frac{1}{L(g + \lambda_2 s_{\mu})}$ \COMMENT{Compute stepsize}
    
    \Repeat
      \STATE $\boldsymbol{z} = \boldsymbol{\theta}_{\operatorname{FISTA}}^{k-1} + \frac{k-2}{k+1}(\boldsymbol{\theta}_{\operatorname{FISTA}}^{k-1} - \boldsymbol{\theta}_{\operatorname{FISTA}}^{k-2})$
      \STATE $\boldsymbol{\theta}_{\operatorname{FISTA}}^k = \operatorname{prox}_{\lambda_1 h}(\boldsymbol{z} - t_{\mu} \nabla(g + \lambda_2 s_{\mu})(\boldsymbol{z}))$
    \Until{$\operatorname{GAP}_{\mu}(\boldsymbol{\theta}_{\operatorname{FISTA}}^k) \le \epsilon_{\mu}^i$} 
    
  \STATE $\boldsymbol{\theta}^{i+1} = \boldsymbol{\theta}_{\operatorname{FISTA}}^k$ \\
  \STATE $\epsilon^i = \operatorname{GAP}_{\mu = \mu_i} \boldsymbol{\theta}^{i+1} + \mu^i \lambda_2 \frac{d}{2}$ \\
  \STATE $\epsilon^{i+1} = \tau \epsilon^{i}$ \\
  \STATE $\mu^{i+1} = \mu_{opt}(\epsilon^{i+1})$
  \Until{$\epsilon^i \le \epsilon$}
  
\end{algorithmic}
\end{algorithm}

3.2 Reformulation of MGLasso for CONESTA algorithm

To apply CONESTA, it is necessary to reformulate the MGLasso problem in order to comply with the form of loss function required by CONESTA. The objective of MGLasso can indeed be written as

\operatorname{argmin} \frac{1}{2} ||\mathbf{Y} - \tilde{\mathbf{X}} \tilde{\boldsymbol{\beta}}||_2^2 + \lambda_1 ||\tilde{\boldsymbol{\beta}}||_1 + \lambda_2 \sum_{i<j} ||\boldsymbol D_{ij} \tilde{\boldsymbol{\beta}}||_2, \tag{6}

where \mathbf{Y} = \operatorname{Vec}(\mathbf{X}) \in \mathbb{R}^{np}, \tilde{\boldsymbol{\beta}} = \operatorname{Vec(\boldsymbol{\beta})} \in \mathbb{R}^{p (p-1)}, \tilde{\mathbf{X}} is a \mathbb{R}^{[np]\times [p \times (p-1)]} block-diagonal matrix with \mathbf{X}^{\setminus i} on the i-th block. The matrix \boldsymbol A_{ij} is a (p-1)\times p(p-1) matrix defined by chosen so that \boldsymbol D_{ij} \tilde{\boldsymbol{\beta}} = \boldsymbol{\beta}^i - \tau_{ij}(\boldsymbol{\beta}^j).

\boldsymbol D_{ij} (k, l) = \begin{cases} 1, \ \text{if} \ l =(i-1)p+k, \\ -1, \ \text{if} \ l = (j-1)p+k,\\ 0, \ \text{otherwise.} \end{cases}

Note that we introduce this notation for simplicity of exposition, but, in practice, the sparsity of the matrices \boldsymbol D_{ij} allows a more efficient implementation. Based on reformulation Equation 6, we may apply CONESTA to solve the objective of MGLasso for fixed \lambda_1 and \lambda_2. The procedure is applied, for fixed \lambda_1, to a range of decreasing values of \lambda_2 to obtain a hierarchical clustering. The corresponding pseudo-code is given in the following algorithm where (\mathbf{X}^i)^{\dagger} denotes the pseudo-inverse of \mathbf{X}^i and \epsilon_{fuse} the threshold for merging clusters.

\begin{algorithm}
\caption{MGLasso algorithm}
\begin{algorithmic}
  \STATE \textbf{Inputs}: \\
    $\quad$ Set of variables $\mathbf{X} = \{\mathbf{X}^1, \dots, \mathbf{X}^p \} \in \mathbb R^{n\times p}$ \\
    $\quad$ Penalty parameters $\lambda_1 \ge 0, {\lambda_2}_{\operatorname{initial}} > 0$ \\
    $\quad$ Increasing factor $\eta > 1$ for fusion penalties $\lambda_2$\\ 
    $\quad$ Fusion threshold $\epsilon_{fuse} \ge 0$
  
  \STATE \textbf{Outputs:} For $\lambda_1$ fixed and $\lambda_2$ from $0$ to ${\lambda_2}_{\operatorname{initial}} \times \eta^{(I)}$ with $I$ the number of iterations: \\
    $\quad$ Regression vectors $\boldsymbol{\beta}(\lambda_1, \lambda_2) \in \mathbb R^{p \times (p-1)}$, \\
    $\quad$ Clusters of variables indices $C(\lambda_1, \lambda_2) =\{ C_k \}_{(\lambda_1, \lambda_2)}, k=1, \dots, K$
    
  \STATE \textbf{Initializations:} \\
    $\quad$ $\boldsymbol{\beta}^i = (\mathbf{X}^i)^{\dagger}\mathbf{X}^i$, $\forall i = 1, \dots, p$ for warm start in CONESTA solver \\
    $\quad$ $C = \left \{\{1\}, \dots, \{p\}\right \}$ Initial clusters with one element per cluster. \\
    $\quad$ Set $\lambda_2 = 0$ \\
    $\quad$ Compute $\boldsymbol{\beta}$ using CONESTA solver (Equivalent to Meinshausen-Bühlmann approach) \\
    $\quad$ Update clusters $C$ with rule described in \textbf{while} loop.
  
  \STATE \textbf{Set:} $\lambda_2 = {\lambda_2}_{\operatorname{initial}}$ \\
  
  \COMMENT{Clustering path}
  \WHILE{$\operatorname{Card}(C) > 1$}
    \STATE Compute $\boldsymbol{\beta}$ using CONESTA solver with warm start from previous iteration \\
    \COMMENT{Clusters update}
    \STATE Compute pairwises distances $d(i,j)=\left \lVert \boldsymbol{\beta}^i - \tau_{ij}(\boldsymbol{\beta}^j) \right \rVert_2$, $\forall i,j \in \{1, \dots, p\}$ \\
    \STATE Determine clusters $C_k (k=1, \dots, K)$ with the rule $(i,j) \in C_k$ iff. $d(i,j) \le \epsilon_{fuse}$
  
    \STATE $\lambda_2 = \lambda_2 \times \kappa$
  \ENDWHILE
\end{algorithmic}
\end{algorithm}

3.3 Model selection

A crucial question for practical applications is the definition of a rule to select the penalty parameters (\lambda_1, \lambda_2). This selection problem operates at two levels: \lambda_1 controls the sparsity of the graphical model, and \lambda_2 controls the number of clusters in the optimal clustering partition. These two parameters are dealt with separately: the sparsity parameter \lambda_1 is chosen via model selection, while the clustering parameter \lambda_2 varies across a grid of values, in order to obtain graphs with different levels of granularity. The problem of model selection in graphical models is difficult in the high dimensional case where the number of samples is small compared to the number of variables, as classical AIC and BIC criteria tend to perform poorly (Liu, Roeder, and Wasserman 2010). Alternative criteria have been proposed in the literature, such as cross-validation (Bien and Tibshirani 2011), GGMSelect (Giraud, Huet, and Verzelen 2012), stability selection (Meinshausen and Bühlmann 2010; Liu, Roeder, and Wasserman 2010), Extended Bayesian Information Criterion (EBIC) (Foygel and Drton 2010), and Rotation Information Criterion (Zhao et al. 2012).

In this paper, we focused on the StARS stability selection approach proposed by Liu, Roeder, and Wasserman (2010). The method uses k subsamples of data to estimate the associated graphs for a given range of \lambda_1 values. For each value, a global instability of the graph edges is computed. The optimal value of \lambda_1 is chosen so as to minimize the instability, as follows. Let \lambda^{(1)}_1, \dots, \lambda_1^{(K)} be a grid of sparsity regularization parameters, and S_1, \dots, S_N be N bootstrap samples obtained by sampling the rows of the data set \mathbf{X}. For each k\in\{1,\ldots,K\} and for each j\in\{1,\ldots, N\}, we denote by \mathcal{A}^{k,j}(\mathbf{X}) the adjacency matrix of the estimated graph obtained by applying the inference algorithm to S_n with regularization parameter \lambda_1^{(k)}. For each possible edge (s,t)\in\{1,\ldots,p\}^2, the probability of edge appearance is estimated empirically by

\hat \theta_{st}^{(k)} = \frac{1}{N} \sum_{j=1}^N \mathcal{A}^{k,j}_{st}.

Define

\hat \xi_{st}(\Lambda) = 2 \hat \theta_{st} (\Lambda) \left ( 1 - \hat \theta_{st} (\Lambda) \right )

the empirical instability of edge (s,t) (that is, twice the variance of the Bernoulli indicator of edge (s,t)). The instability level associated to \lambda_1^{(k)} is given by

\hat D(\lambda_1^{(k)}) = \frac{\sum_{s<t} \hat \xi_{st}(\lambda_1^{(k)})}{p \choose 2}, StARS selects the optimal penalty parameter as follows

\hat \lambda = \max_k\left\{ \lambda_1^{(k)}: \hat D(\lambda_1^{(k)}) \le \beta, k\in\{1,\ldots,K\} \right \},

where \beta is the threshold chosen for the instability level.

4 Simulation experiments

In this section, we conduct a simulation study to evaluate the performance of the MGLasso method, both in terms of clustering and support recovery. Receiver Operating Characteristic (ROC) curves are used to evaluate the adequacy of the inferred graphs with the reality ground truth for the MGLasso and GLasso methods in its neighborhood selection version in the Erdös-Renyi, Scale-free, and Stochastic Block Models frameworks. The Adjusted Rand indices are used to compare the partitions obtained with MGLasso, hierarchical agglomerative clustering, and K-means clustering in a stochastic block model framework.

4.1 Synthetic data models

We consider three different synthetic network models: the Stochastic Block Model (SBM, (Fienberg and Wasserman 1981)), the Erdös-Renyi model (Erdős, Rényi, et al. 1960) and the Scale-Free model (Newman, Strogatz, and Watts 2001). In each case, Gaussian data is generated by drawing n independent realizations of a multivariate Gaussian distribution \mathcal N(0, \mathbf{\Sigma}) where \mathbf{\Sigma} \in \mathbb{R}^{p \times p} and \mathbf{\Omega} = \mathbf{\Sigma} ^{-1}. The support of \mathbf{\Omega}, equivalent to the network adjacency matrix, is generated from the three different models. The difficulty level of the problem is controlled by varying the ratio \frac{n}{p} with p fixed at 40: \frac{n}{p}\in \{0.5,1,2\}.

4.1.1 Stochastic Block-Model

We construct a block-diagonal precision matrix \mathbf{\Omega} as follows. First, we generate the support of \mathbf{\Omega} as shown in Figure 1, denoted by \boldsymbol A\in\{0,1\}^{p\times p}. To do this, the variables are first partitioned into K = 5 hidden groups, noted C_1, \dots, C_K described by a latent random variable Z_i, such that Z_i = k if i = C_k. Z_i follows a multinomial distribution

P(Z_i = k) = \pi_k, \quad \forall k \in \{1, \dots, K\}, where \pi = (\pi_1, \dots, \pi_k) is the vector of proportions of clusters whose sum is equal to one. The set of latent variables is noted \mathbf{Z} = \{ Z_1, \dots, Z_K\}. Conditionally to \mathbf{Z}, A_{ij} follows a Bernoulli distribution such that

A_{ij}|Z_i = k, Z_j = l \sim \mathcal{B}(\alpha_{kl}), \quad \forall k,l \in \{1 \dots, K\},

where \alpha_{kl} is the probability of inter-cluster connectivity, with \alpha_{kl} = 0.01 if k\neq l and \alpha_{ll} = 0,75. For k\in\{1,\ldots, K\}, we define p_k = \sum_{i=1}^p \boldsymbol{1}_{\{Z_i = k\}}. The precision matrix \mathbf{\Omega} of the graph is then calculated as follows. We define \Omega_{ij} = 0 if Z_i\neq Z_j ; otherwise, we define \Omega_{ij} = A_{ij}\omega_{ij} where, for all i\in\{1,\ldots,p\} and for all j\in\{1,\ldots,p| Z_j = Z_i\}, \omega_{ij} is given by :

If \alpha_{ll} were to be equal to one, this construction of \mathbf{\Omega} would make it possible to control the level of correlation between the variables in each block to \rho. Introducing a more realistic scheme with \alpha_{ll}=0.75 allows only to have an approximate control.

Show the code
bloc_diag <- function(n_vars, connectivity_mat, prop_clusters, rho) {
  
  true_nclusters <- 0
  
  while (true_nclusters != length(prop_clusters)){ ## To make sure we have the required number of clusters
    network <- simone::rNetwork(n_vars, connectivity_mat, prop_clusters)
    true_nclusters <- length(unique(network$clusters))
  }
  
  precision_mat <- network$A[order(network$clusters), order(network$clusters)]
  
  eff_clusters <- table(network$clusters)
  
  b = -rho/(1+rho*(eff_clusters-2) -rho^2*(eff_clusters-1))
  d = (1+rho*(eff_clusters-2))/(1+rho*(eff_clusters-2) -rho^2*(eff_clusters-1))
  
  
  for (i in 1:length(eff_clusters)) {
    temp <- precision_mat[which(row.names(precision_mat) == i), which(row.names(precision_mat) == i)]
    temp <- as.matrix(temp)
    temp[temp != 0] = b[i]
    diag(temp) = d[i]
    precision_mat[which(row.names(precision_mat) == i), which(row.names(precision_mat) == i)] = temp
  }
  
  flag <- min(svd(precision_mat)$d)
  if(flag < 0){
    diag(precision_mat) <- 1 - flag
  }
  
  return(precision_mat)
}

#' simulate data with given graph structure
sim_data <- function(p = 20,
                     np_ratio = 2,
                     structure = c("block_diagonal", "hub", "scale_free", "erdos"),
                     alpha,
                     prob_mat,
                     rho,
                     g,
                     inter_cluster_edge_prob = 0.01,
                     p_erdos = 0.1,
                     verbose = FALSE){
  
  structure <- match.arg(structure)
  n = round(np_ratio * p)
  
  switch (structure,
          erdos = {
            network <- simone::rNetwork(p, p_erdos, 1)
            correlation <- solve(network$Theta)
            X           <- mvtnorm::rmvnorm(n, sigma = correlation)
            graph <- network$Theta
            
            if(verbose) message("Data, precision matrix and graph generated from Erdos structure.")
          },
          
          hub = {
            L = huge::huge.generator(graph = "hub", g = 5, d = p, n = n, verbose = FALSE)
            
            X=L$data
            graph = L$omega
            correlation = L$sigma
            
            if(verbose) message("Data, precision matrix and graph generated from Hub structure.")
          },
          
          scale_free = {
            L = huge::huge.generator(graph = "scale-free", d = p, n = n, verbose = FALSE) ## d edges graph
            
            X=L$data
            graph = L$omega
            correlation = L$sigma
            
            if(verbose) message("Data, precision matrix and graph generated from Scale free structure.")
          },
          
          block_diagonal = { 
            
            if(inter_cluster_edge_prob != 0) { # inter-clusters edges added in the flipped way
              flag <- TRUE
              
              while(flag) {
                K <- bloc_diag(p, prob_mat, alpha, rho)
                
                target_indices <- which(K[upper.tri(K)] == 0) # Random selection of edges to be set to 1
                
                select_len <- round(length(target_indices) * inter_cluster_edge_prob)
                selected_indices <- sample(target_indices, select_len)
                
                precision_level <- unique(K[upper.tri(K)])
                precision_level <- max(precision_level[precision_level != 0]) 
                
                K[upper.tri(K)][selected_indices] <- precision_level
                K <- as.matrix(Matrix::forceSymmetric(K, uplo = "U"))
                
                flag <- any(eigen(K)$values <= 0) # Control of positive definiteness
              }
              
              correlation <- solve(K)
              graph = K
              X           <- mvtnorm::rmvnorm(n, sigma = correlation)
              
              if(verbose) message("Data, precision matrix and graph generated from block-diagonal 
                                  structure with inter-clusters edges.")
            }
            else { # Only intra-cluster edges while approximately controlling correlation level
              K <- bloc_diag(p, prob_mat, alpha, rho)
              correlation <- solve(K)
              graph = K
              X           <- mvtnorm::rmvnorm(n, sigma = correlation)
              
              if(verbose) message("Data, precision matrix and graph generated from block-diagonal 
                                  structure with only intra-clusters edges.")
            }
          }
  )
  return(list(X=X, graph=graph, correlation=correlation))
}

adj_mat <- function(mat) {
  diag(mat) <- 0
  mat[ abs(mat) < 1e-10] <- 0
  mat[mat != 0] <- 1
  return(mat)
}

set.seed(2020)
sim_sbm <- sim_data(p = 40, structure = "block_diagonal", 
                    alpha = rep(1/5, 5), 
                    prob_mat = diag(0.75, 5), 
                    rho = 0.2, inter_cluster_edge_prob = 0.01)
gsbm <- adj_mat(sim_sbm$graph)
Matrix::image(as(gsbm, "sparseMatrix"),
      sub = "", xlab = "", ylab = "")   

Figure 1: Adjacency matrix of a stochastic block model with 5 blocks.

4.1.2 Erdös-Renyi Model

The Erdös-Renyi model is a special case of the stochastic block model where \alpha_{kl} = \alpha_{ll} = \alpha is constant. We set the density \alpha of the graph to 0.1; see Figure 2 for an example of the graph resulting from this model.

Show the code
set.seed(2022)
sim_erdos <- sim_data(p = 40, structure = "erdos", p_erdos = 0.1)
gerdos <- adj_mat(sim_erdos$graph)
Matrix::image(as(gerdos, "sparseMatrix"),
      sub = "", xlab = "", ylab = "")

Figure 2: Adjacency matrix of an Erdös-Renyi model.

4.1.3 Scale-free Model

The Scale-free Model generates networks whose degree distributions follow a power law. The graph starts with an initial chain graph of 2 nodes. Then, new nodes are added to the graph one by one. Each new node is connected to an existing node with a probability proportional to the degree of the existing node. We set the number of edges in the graph to 40. An example of scale-free graph is shown in Figure 3.

Show the code
set.seed(2022)
sim_sfree <- sim_data(p = 40, structure = "scale_free")

gsfree <- adj_mat(sim_sfree$graph)

Matrix::image(as(gsfree, "sparseMatrix"),
      sub = "", xlab = "", ylab = "")

Figure 3: Adjacency matrix of a Scale-free model.

4.2 Support recovery

We compare the network structure learning performance of our approach to that of GLasso in its neighborhood selection version using ROC curves. In both GLasso and MGLasso, the sparsity is controlled by a regularization parameter \lambda_1; however, MGLasso admits an additional regularization parameter, \lambda_2, which controls the strength of convex clustering. To compare the two methods, in each ROC curve, we vary the parameter \lambda_1 while the parameter \lambda_2 (for MGLasso) is kept constant. We computed ROC curves for 4 different penalty levels for the \lambda_2 parameter; since GLasso does not depend on \lambda_2, the GLasso ROC curves are replicated.

In a decision rule associated with a sparsity penalty level \lambda_1, we recall the definition of the two following functions. The sensitivity, also called the true positive rate or recall, is given by : \begin{align*} \lambda_1 &\mapsto \text{sensitivity}(\lambda_1) = \frac{TP(\lambda_1)}{TP(\lambda_1) + FN(\lambda_1)}. \end{align*} Specificity, also called true negative rate or selectivity, is defined as follows: \begin{align*} \lambda_1 &\mapsto \text{specificity}(\lambda_1) = \frac{TN(\lambda_1)}{TN(\lambda_1) + FP(\lambda_1)}. \end{align*} The ROC curve with the parameter \lambda_1 represents \text{sensitivity}(\lambda_1) as a function of 1 - \text{specificity}(\lambda_1) which is the false positive rate.

For each configuration (n, p fixed), we generate 50 replications and their associated ROC curves, which are then averaged. The average ROC curves for the three models are given in Figure 4, Figure 5 and Figure 6 by varying \frac{n}{p}\in \{0.5,1,2\}.

Show the code
path_checks <- "./rscripts/mglasso_functions/"
source(paste0(path_checks, "simulate.R"))
source(paste0(path_checks, "plot.R"))
Show the code
# The code used to generate the following results is based on R and python libraries/functions.
# the reticulate package allows to compile python code in Rstudio
# First set up the python dependencies before running the code

library(reticulate)
config <- reticulate::py_config() # Initialize python engine 
# It is possible to create an isolated virtual or conda environment in which the code will be compile
# by calling reticulate::virtualenv_create() or reticulate::conda_create() and then activate the environment with reticulate::use_condaenv() or reticulate::use_virtualenv(). 
system2(config$python, c("-m", "pip", "install",
                          shQuote("git+https://github.com/neurospin/pylearn-parsimony.git")))
# The pylean-parsimony require scipy version 1.7.1. More recent versions generate compilation errors.
reticulate::py_install(packages = c("scipy == 1.7.1",
                                     "scikit-learn",
                                      "numpy == 1.22.4",
                                      "six", # If needed
                                      "matplotlib"))

# To check if all required python dependencies are available.
testthat::expect_true(reticulate::py_module_available("numpy"))
testthat::expect_true(reticulate::py_module_available("scipy"))
testthat::expect_true(reticulate::py_module_available("six"))
testthat::expect_true(reticulate::py_module_available("matplotlib"))
testthat::expect_true("scikit-learn" %in% reticulate::py_list_packages()$package)
testthat::expect_true("pylearn-parsimony" %in% reticulate::py_list_packages()$package)

# It might be necessary to reload Rstudio on some operator systems.
source("./rscripts/mglasso_functions/onload.R")
Show the code
# Performances calculation ------------------------------------------------
# Launched on a cluster using 72 cores

## Settings ----------------------------------------------------------------
### Model -------------------------------------------------------------------
# NA values for some parameter mean they are not relevant
p         <- 40
seq_n     <- c(20, 40, 80)
seq_rho   <- 0.95
seq_dnsty <- NA 
type      <- NA
alpha     <- rep(1/5, 5)
ngroup    <- length(alpha)
pi        <- diag(0.75, ngroup)

### Simulation --------------------------------------------------------------
n_simu      <- 50
list_ii_rho <- configs_simu(n_simu, seq_rho, seq_dnsty, seq_n, type)
no_cores    <- min(72, length(list_ii_rho))

### Erdos -------------------------------------------------------------------
runtime_roc_config_p40_erdos01 <- system.time(
  roc_config_p40_erdos01 <- mclapply(
    list_ii_rho, 
    FUN = one_simu_ROC, 
    model = "erdos",
    mc.cores = no_cores)
)

save(roc_config_p40_erdos01, 
     file = paste0(path_roc, "roc_config_p40_erdos01.RData"))
save(runtime_roc_config_p40_erdos01, 
     file = paste0(path_roc, "runtime_roc_config_p40_erdos01.RData"))

## Erdos
load(paste0(path_roc, "roc_config_p40_erdos01.RData")) 
dt_full <- roc_config_p40_erdos01

### Merge in one graph
# Three sample sizes are used and the vector c(20,40,80) is replicated 50 times
# I subset the dataframe in three parts corresponding to the relevant sample sizes
index <- seq(1, 150, by = 3)
roc_dt20 <- dt_full[index]

index <- seq(2, 150, by = 3)
roc_dt40 <- dt_full[index]

index <- seq(3, 150, by = 3)
roc_dt80 <- dt_full[index]

# Here we compute the mean over the 50 ROC curves
roc_dt20 <- get_mean_ROC_stat(roc_dt20)
roc_dt40 <- get_mean_ROC_stat(roc_dt40)
roc_dt80 <- get_mean_ROC_stat(roc_dt80)

# I restructure the list result in a matrix for plot
roc_dt20 <- reformat_roc_res_for_ggplot(roc_dt20)
roc_dt20$np <- 0.5 # add a ratio n over p variable
roc_dt40 <- reformat_roc_res_for_ggplot(roc_dt40)
roc_dt40$np <- 1
roc_dt80 <- reformat_roc_res_for_ggplot(roc_dt80)
roc_dt80$np <- 2

roc_dtf_erdos <- rbind(roc_dt20, roc_dt40, roc_dt80)
Show the code
load("./data/roc_dtf_erdos.RData")

ggplot(roc_dtf_erdos, aes(x     = fpr, 
                          y     = tpr, 
                          color = method )) + 
  geom_line(size = 0.7) + 
  facet_grid(np ~ tv) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "grey") +
  xlab("False Positive Rate") +
  ylab("True Positive Rate") +
  ggtitle("") +
  scale_colour_manual(values = ghibli_palette("MarnieMedium1")[5:6])

Figure 4: ROC curves for the Erdös-Renyi model comparing MGLasso and GLasso methods.

Show the code
# Performances calculation ------------------------------------------------
# Launched on a cluster using 72 cores

## Settings ----------------------------------------------------------------
### Model -------------------------------------------------------------------
# NA values for some parameter mean they are not relevant
p         <- 40
seq_n     <- c(20, 40, 80)
seq_rho   <- 0.95
seq_dnsty <- NA 
type      <- NA
alpha     <- rep(1/5, 5)
ngroup    <- length(alpha)
pi        <- diag(0.75, ngroup)

### Simulation --------------------------------------------------------------
n_simu      <- 50
list_ii_rho <- configs_simu(n_simu, seq_rho, seq_dnsty, seq_n, type)
no_cores    <- min(72, length(list_ii_rho))


### Scale-Free --------------------------------------------------------------
runtime_roc_config_p40_scalefree <- system.time(
  roc_config_p40_scalefree <- mclapply(
    list_ii_rho, 
    FUN = one_simu_ROC, 
    model = "scale_free",
    mc.cores = no_cores)
)

save(roc_config_p40_scalefree, 
     file = paste0(path_roc, "roc_config_p40_scalefree.RData"))
save(runtime_roc_config_p40_scalefree, 
     file = paste0(path_roc, "runtime_roc_config_p40_scalefree.RData"))

## Scale-free
load(paste0(path_roc, "roc_config_p40_scalefree.RData")) 
dt_full <- roc_config_p40_scalefree

### Merge in one graph
index <- seq(1, 150, by = 3)
roc_dt20 <- dt_full[index]

index <- seq(2, 150, by = 3)
roc_dt40 <- dt_full[index]

index <- seq(3, 150, by = 3)
roc_dt80 <- dt_full[index]

roc_dt20 <- get_mean_ROC_stat(roc_dt20)
roc_dt40 <- get_mean_ROC_stat(roc_dt40)
roc_dt80 <- get_mean_ROC_stat(roc_dt80)

roc_dt20 <- reformat_roc_res_for_ggplot(roc_dt20)
roc_dt20$np <- 0.5
roc_dt40 <- reformat_roc_res_for_ggplot(roc_dt40)
roc_dt40$np <- 1
roc_dt80 <- reformat_roc_res_for_ggplot(roc_dt80)
roc_dt80$np <- 2

roc_dtf_sfree <- rbind(roc_dt20, roc_dt40, roc_dt80)

### Save
save(roc_dtf_sfree,
     file = paste0(path_roc, "roc_dtf_sfree.RData"))
Show the code
load("./data/roc_dtf_sfree.RData")

ggplot(roc_dtf_sfree, aes(x     = fpr, 
                          y     = tpr, 
                          color = method )) + 
  geom_line() + 
  facet_grid(np ~ tv) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "grey") +
  xlab("False Positive Rate") +
  ylab("True Positive Rate") +
  ggtitle("") + 
  scale_colour_manual(values = ghibli_palette("MarnieMedium1")[5:6])

Figure 5: ROC curves for the Scale-free model comparing MGLasso and GLasso methods.

Show the code
# Launched on a cluster using 72 cores
## Settings ----------------------------------------------------------------
### Model -------------------------------------------------------------------
# NA values for some parameter mean they are not relevant
p         <- 40
seq_n     <- c(20, 40, 80)
seq_rho   <- 0.95
seq_dnsty <- NA 
type      <- NA
alpha     <- rep(1/5, 5)
ngroup    <- length(alpha)
pi        <- diag(0.75, ngroup)

### Simulation --------------------------------------------------------------
n_simu      <- 50
list_ii_rho <- configs_simu(n_simu, seq_rho, seq_dnsty, seq_n, type)
no_cores    <- min(72, length(list_ii_rho))

### Stochastic Block Diagonal -----------------------------------------------
runtime_roc_config_p40_bdiagflip001 <- system.time(
  roc_config_p40_bdiagflip001 <- mclapply(
    list_ii_rho, 
    FUN = one_simu_ROC, 
    model = "block_diagonal",
    mc.cores = no_cores)
)
save(roc_config_p40_bdiagflip001, 
     file = paste0(path_roc, "roc_config_p40_bdiagflip001.RData"))
save(runtime_roc_config_p40_bdiagflip001, 
     file = paste0(path_roc, "runtime_roc_config_p40_bdiagflip001.RData"))

load(paste0(path_roc, "roc_config_p40_bdiagflip001.RData")) 
dt_full <- roc_config_p40_bdiagflip001

### Merge in one graph
index <- seq(1, 150, by = 3)
roc_dt20 <- dt_full[index]

index <- seq(2, 150, by = 3)
roc_dt40 <- dt_full[index]

index <- seq(3, 150, by = 3)
roc_dt80 <- dt_full[index]

roc_dt20 <- get_mean_ROC_stat(roc_dt20)
roc_dt40 <- get_mean_ROC_stat(roc_dt40)
roc_dt80 <- get_mean_ROC_stat(roc_dt80)

roc_dt20 <- reformat_roc_res_for_ggplot(roc_dt20)
roc_dt20$np <- 0.5
roc_dt40 <- reformat_roc_res_for_ggplot(roc_dt40)
roc_dt40$np <- 1
roc_dt80 <- reformat_roc_res_for_ggplot(roc_dt80)
roc_dt80$np <- 2

roc_dtf_sbm <- rbind(roc_dt20, roc_dt40, roc_dt80)

save(roc_dtf_sbm,
     file = paste0(path_roc, "roc_dtf_sbm.RData"))
Show the code
load("./data/roc_dtf_sbm.RData")

ggplot(roc_dtf_sbm, aes(x     = fpr, 
                          y     = tpr, 
                          color = method )) + 
  geom_line() + 
  facet_grid(np ~ tv) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "grey") +
  xlab("False Positive Rate") +
  ylab("True Positive Rate") +
  ggtitle("") +
  scale_colour_manual(values = ghibli_palette("MarnieMedium1")[5:6])

Figure 6: ROC curves for the Stochastic Block model comparing MGLasso and GLasso methods.

Based on these empirical results, we first observe that, in all the considered simulation models, MGLasso improves over GLasso in terms of support recovery in the high-dimensional setting where p<n. In addition, in the absence of a total variation penalty, i.e., \lambda_2 = 0, MGLasso performs no worse than GLasso in each of the 3 models. However, for \lambda_2>0, increasing penalty value does not seem to significantly improve the support recovery performances for the MGLasso, as we observe similar results for \lambda_2=3.3,6.6,10. Preliminary analyses show that, as \lambda_2 increases, the estimates of the regression vectors are shrunk towards 0. This shrinkage effect of group-fused penalty terms was also observed in (Chu et al. 2021).

4.3 Clustering

In order to obtain clustering performance, we compared the partitions estimated by MGLasso, Hierarchical Agglomerative Clustering (HAC) with Ward’s distance and K-means to the true partition in a stochastic block model framework. Euclidean distances between variables are used for HAC and K-means. The criterion used for the comparison is the adjusted Rand index. We studied the influence of the correlation level inside clusters on the clustering performances through two different parameters: \rho \in \{ 0.1, 0.3 \}; the vector of cluster proportions is fixed at \mathbf \pi = (1/5, \dots, 1/5). We then simulate 100 Gaussian data sets for each simulation configuration (\rho, n/p fixed).The optimal sparsity penalty for MGLasso is chosen by the Stability Approach to Regularization Selection (StARS) method (Liu, Roeder, and Wasserman 2010), and we vary the parameter \lambda_2.

Show the code
# Launch simulations ------------------------------------------------------
## Settings ----------------------------------------------------------------
### Model -------------------------------------------------------------------
p         <- 40
seq_n     <- c(20, 40, 80) 
alpha     <- rep(1/5, 5)
seq_rho   <- c(0.25, 0.95)
seq_dnsty <- c(0.75)
type      <- NA     #1 ## unused to do: delete in configs_simu parameters
ngroup    <- length(alpha)
pi        <- diag(0.75, ngroup)

### Simulation --------------------------------------------------------------
n_simu      <- 100
list_ii_rho <- configs_simu(n_simu, seq_rho, seq_dnsty, seq_n, type)
mc_cores    <- min(80, length(list_ii_rho))
RNGkind("L'Ecuyer-CMRG")

## Test --------------------------------------------------------------------
# For a quicker test: 
#   #set nl2 to 2 in one_simu_extended
#   set p = 9 & n = 10
test <- one_simu_extended(list_ii_rho$`1`, verbose = TRUE, model = "block_diagonal")

## Models ------------------------------------------------------------------
# After the quicker test: 
#   reset nl2 to 20

### Stochastic Block Diagonal -----------------------------------------------
runtime_rand50_config_p40_bdiagflip001_allcor <- system.time(
  rand50_config_p40_bdiagflip001_allcor <- mclapply(
    list_ii_rho, 
    FUN = one_simu_extended, 
    model = "block_diagonal",
    mc.cores = mc_cores)
)

save(runtime_rand50_config_p40_bdiagflip001_allcor, 
     file = paste0(path_extended, "runtime_rand100_config_p40_bdiagflip001_allcor.RData"))
save(rand50_config_p40_bdiagflip001_allcor, 
     file = paste0(path_extended, "rand100_config_p40_bdiagflip001_allcor.RData"))
Show the code
# Clustering  
# Settings:  
# - Inter-clusters edge probability $0.01$ (flip on all the missing edges)  

## Rand index  
load(paste0(path_extended, "rand100_config_p40_bdiagflip001_allcor.RData")) 
dt <- rand100_config_p40_bdiagflip001_allcor

# Calculate clusters partitions with thresh_fuse as the required difference threshold for merging two regression vectors
list_res <- lapply(dt, function(e){get_perf_from_raw("rand", e, thresh_fuse = 1e-6)})
dt_rand <- do.call(rbind, list_res)

save(dt_rand,
     file = paste0(path_extended, "dt_rand_p40_bdiagflip001_allcor_thresh_e6.RData"))

## Theoritical correlation set to $0.25$

plot_res(dt_rand, crit_ = "rand", ncluster_ = c(5, 10, 15, 20), cor_ = 0.25, np_ = c(0.5, 1, 2))

## Theoritical correlation set to $0.95$
plot_res(dt_rand, crit_ = "rand", ncluster_ = c(5, 10, 15, 20), cor_ = 0.95, np_ = c(0.5, 1, 2))

# The files `rand_dt_higher_cor_sbm.RData` and `rand_dt_lower_cor_sbm.RData` are obtained from splitting `dt_rand`according to theoritical correlation levels.
Show the code
load("./data/rand_dt_lower_cor_sbm.RData")
plot_res(dt_rand, crit_ = "rand", ncluster_ = c(5, 10, 15, 20), cor_ = 0.25, np_ = c(0.5, 1, 2), 
         main = "")

Figure 7: Adjusted Rand Indices for the HAC, k-means and MGLasso methods. Performances are observed for 4 different number of clusters in a low correlation context.

Show the code
load("./data/rand_dt_higher_cor_sbm.RData")
plot_res(dt_rand, crit_ = "rand", ncluster_ = c(5, 10, 15, 20), cor_ = 0.95, np_ = c(0.5, 1, 2),
         main = "")

Figure 8: Adjusted Rand Indices for the HAC, k-means and MGLasso methods. Performances are observed for 4 different number of clusters in a high correlation context.

The results shown in Figure 7 and Figure 8 demonstrate that, particularly at the lower to medium levels of the hierarchy (between 20 and 10 clusters), the hierarchical clustering structure uncovered by MGLasso is comparable to popular clustering methods used in practice. For the higher levels (5 clusters), the performances of MGLasso deteriorate. As expected, the three compared methods also deteriorate as the level of correlation inside clusters decreases.

5 Analysis of microbial associations data

We finally illustrate our new method of inferring the multiscale Gaussian graphical model, with an application to the analysis of microbial associations in the American Gut Project. The data used are count data that have been previously normalized by applying the log-centered ratio technique as used in (Kurtz 2015). After some filtering steps (Kurtz 2015) on the operational taxonomic units OTUs counts (removed if present in less than 37\% of the samples) and the samples (removed if sequencing depth below 2700), the top OTUs are grouped in a dataset composed of n_1 = 289 for 127 OTUs. As a preliminary analysis, we perform a hierarchical agglomerative clustering (HAC) on the OTUs, which allows us to identify four significant groups. The correlation matrix of the dataset is given in Figure 9; variables have been rearranged according to the HAC partition.

Show the code
data(amgut1.filt, package = "SpiecEasi")
dta <- amgut1.filt
dta <- t(SpiecEasi::clr(dta + 1 , 1))

S <- cor(dta)
hclust_dta <- hclust(dist(t(dta)), method = "ward.D")
hclust_dta <- hclust(as.dist(1-S^2), method = "ward.D")

cut_dta <- stats::cutree(hclust_dta, 4)

Matrix::image(as(S[order(cut_dta), order(cut_dta)],
         "sparseMatrix"), 
      sub = "", xlab = "", ylab = "",
      colorkey = FALSE)

Figure 9: Empirical correlations in the gut data.

The average correlations within blocks of variables belonging to the same cluster are given below. We observe relatively high levels of correlation in small blocks, similar to the simulation models used to evaluate the performance of clustering in the Simulation Experiments section.

Show the code
C <- cor(dta)
diag(C) <- 0
clusters <- cut_dta

seq_p <- 1:length(clusters)
L <- split(seq_p, factor(clusters))

mat <- t(sapply(L,
                FUN = function(v) {
                  summary(as.vector(C[v,v]))
                }))

out <- data.frame(Cluster = 1:4, "Mean correlation" = round(mat[, "Mean"], 3))
knitr::kable(out)
Cluster Mean.correlation
1 0.013
2 0.815
3 0.555
4 0.566

We apply MGLasso to the normalised counts to infer a graph and a clustering structure. The graphs obtained by MGLasso for \lambda_2 = 0 and for \lambda_2 = 5 (corresponding respectively 127 and 80 clusters) are given below. In each case, the parameter \lambda_1 is chosen by stability selection (see Model Selection section).

Show the code
knitr::include_graphics(c("figures/mglasso_tv0.png","figures/mglasso_full_graph_tv5.png"))

Figure 10: Infered graphs using MGLasso for different values of total variation penalty.

The variables were reordered according to the clustering partition obtained from the distances between the regression vectors. Increasing \lambda_2 reduces the number of clusters and leads to a shrinking effect on the estimates. The adjacency matrix of the neighborhood selection equivalent to setting \lambda_2 to 0 is given in Figure 10 (up). In Figure 10 (down), the deduced partition is composed of 80 clusters. A confusion matrix comparing the edges deduced by MGLasso with \lambda_2 = 5 and neighborhood selection is given below. Adding a total variation parameter increases the merging effect, resulting in a larger number of edges in the graph.

Neighborhood selection
MGLasso (\lambda_2 = 5) non-edges edges
non-edges 15678 0
edges 288 163

6 Conclusion

We proposed a new technique that combines Gaussian Graphical Model inference and hierarchical clustering called MGLasso. The method proceeds via convex optimization and minimizes the neighborhood selection objective penalized by a hybrid regularization combining a sparsity-inducing norm and a convex clustering penalty. We developed a complete numerical scheme to apply MGLasso in practice, with an optimization algorithm based on CONESTA and a model selection procedure. Our simulations results over synthetic and real datasets showed that MGLasso can perform better than GLasso in network support recovery in the presence of groups of correlated variables, and we illustrated the method with the analysis of microbial associations data. The present work paves the way for future improvements: first, by incorporating prior knowledge through more flexible weighted regularization; second, by studying the theoretical properties of the method in terms of statistical guarantees for the MGLasso estimator. Moreover the node-wise regression approach on which our method is based can be extended to a broader family of non-Gaussian distributions belonging to the exponential family as outlined by Yang et al. (2012).

Session information

Show the code
sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)

Matrix products: default

locale:
[1] LC_COLLATE=French_France.utf8  LC_CTYPE=French_France.utf8   
[3] LC_MONETARY=French_France.utf8 LC_NUMERIC=C                  
[5] LC_TIME=French_France.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ghibli_0.3.2      Matrix_1.4-1      huge_1.3.5        SpiecEasi_1.1.2  
[5] simone_1.0-4      blockmodels_1.1.5 ggplot2_3.3.6    

loaded via a namespace (and not attached):
 [1] shape_1.4.6       tidyselect_1.1.2  xfun_0.31         purrr_0.3.4      
 [5] splines_4.2.1     lattice_0.20-45   colorspace_2.0-3  vctrs_0.4.1      
 [9] generics_0.1.3    htmltools_0.5.3   stats4_4.2.1      yaml_2.3.5       
[13] utf8_1.2.2        survival_3.3-1    rlang_1.0.4       pillar_1.8.1     
[17] glue_1.6.2        withr_2.5.0       DBI_1.1.3         foreach_1.5.2    
[21] lifecycle_1.0.1   stringr_1.4.1     munsell_0.5.0     gtable_0.3.0     
[25] mvtnorm_1.1-3     htmlwidgets_1.5.4 codetools_0.2-18  VGAM_1.1-7       
[29] evaluate_0.16     labeling_0.4.2    knitr_1.40        fastmap_1.1.0    
[33] parallel_4.2.1    fansi_1.0.3       highr_0.9         Rcpp_1.0.9       
[37] scales_1.2.0      jsonlite_1.8.0    farver_2.1.1      pulsar_0.3.7     
[41] digest_0.6.29     stringi_1.7.8     dplyr_1.0.9       grid_4.2.1       
[45] cli_3.3.0         tools_4.2.1       magrittr_2.0.3    glmnet_4.1-4     
[49] tibble_3.1.8      pkgconfig_2.0.3   MASS_7.3-57       assertthat_0.2.1 
[53] rmarkdown_2.16.1  rstudioapi_0.13   iterators_1.0.14  R6_2.5.1         
[57] prismatic_1.1.0   igraph_1.3.4      compiler_4.2.1   

References

Ambroise, Christophe, Julien Chiquet, and Catherine Matias. 2009. Inferring sparse gaussian graphical models with latent structure.” Electronic Journal of Statistics 3 (0): 205–38. https://doi.org/10.1214/08-EJS314.
Banerjee, Onureena, Laurent El Ghaoui, and Alexandre d’Aspremont. 2008. “Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data” 9 (June): 485–516.
Beck, Amir, and Marc Teboulle. 2009. “A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems.” SIAM J. Imaging Sciences 2 (January): 183–202. https://doi.org/10.1137/080716542.
Bien, Jacob, and Robert J Tibshirani. 2011. “Sparse Estimation of a Covariance Matrix.” Biometrika 98 (4): 807–20.
Bühlmann, Peter, Philipp Rütimann, Sara Van De Geer, and Cun-Hui Zhang. 2012. Correlated variables in regression: clustering and sparse estimation.”
Cai, Tony, Weidong Liu, and Xi Luo. 2011. “A Constrained L1 Minimization Approach to Sparse Precision Matrix Estimation.” Journal of the American Statistical Association 106 (494): 594–607. https://doi.org/10.1198/jasa.2011.tm10155.
Chakraborty, Saptarshi, and Jason Xu. 2020. “Biconvex Clustering.” arXiv Preprint arXiv:2008.01760.
Chen, Xi, Seyoung Kim, Qihang Lin, Jaime G Carbonell, and Eric P Xing. 2010. “Graph-Structured Multi-Task Regression and an Efficient Optimization Method for General Fused Lasso.” arXiv Preprint arXiv:1005.3579.
Cheng, Lulu, Liang Shan, and Inyoung Kim. 2017. Multilevel Gaussian graphical model for multilevel networks.” Journal of Statistical Planning and Inference 190 (November): 1–14. https://doi.org/10.1016/j.jspi.2017.05.003.
Chi, Eric C, and Kenneth Lange. 2015. “Splitting Methods for Convex Clustering.” Journal of Computational and Graphical Statistics 24 (4): 994–1013.
Chiquet, Julien, Pierre Gutierrez, and Guillem Rigaill. 2017. “Fast Tree Inference with Weighted Fusion Penalties.” Journal of Computational and Graphical Statistics 26 (1): 205–16.
Chu, Shuyu, Huijing Jiang, Zhengliang Xue, and Xinwei Deng. 2021. “Adaptive Convex Clustering of Generalized Linear Models with Application in Purchase Likelihood Prediction.” Technometrics 63 (2): 171–83.
Danaher, Patrick, Pei Wang, and Daniela M Witten. 2014. “The Joint Graphical Lasso for Inverse Covariance Estimation Across Multiple Classes.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76 (2): 373–97.
Degras, David. 2021. “Sparse Group Fused Lasso for Model Segmentation: A Hybrid Approach.” Advances in Data Analysis and Classification 15 (3): 625–71.
Dempster, A. P. 1972. Covariance Selection.” Biometrics 28 (1): 157. https://doi.org/10.2307/2528966.
Devijver, Emilie, and Mélina Gallopin. 2018. Block-Diagonal Covariance Selection for High-Dimensional Gaussian Graphical Models.” Journal of the American Statistical Association 113 (521): 306–14. https://doi.org/10.1080/01621459.2016.1247002.
Erdős, Paul, Alfréd Rényi, et al. 1960. “On the Evolution of Random Graphs.” Publ. Math. Inst. Hung. Acad. Sci 5 (1): 17–60.
Fan, Jianqing, Yuan Liao, and Han Liu. 2016. “An Overview of the Estimation of Large Covariance and Precision Matrices.” The Econometrics Journal 19 (1): C1–32. https://doi.org/https://doi.org/10.1111/ectj.12061.
Fienberg, Stephen E, and Stanley S Wasserman. 1981. “Categorical Data Analysis of Single Sociometric Relations.” Sociological Methodology 12: 156–92.
Foygel, Rina, and Mathias Drton. 2010. “Extended Bayesian Information Criteria for Gaussian Graphical Models.” arXiv Preprint arXiv:1011.6640.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2007. Sparse inverse covariance estimation with the graphical lasso.”
Ganguly, Apratim, and Wolfgang Polonik. 2014. “Local Neighborhood Fusion in Locally Constant Gaussian Graphical Models.” https://arxiv.org/abs/1410.8766.
Giraud, Christophe, Sylvie Huet, and Nicolas Verzelen. 2012. “Graph Selection with GGMselect.” Statistical Applications in Genetics and Molecular Biology 11 (3).
Grechkin, Maxim, Maryam Fazel, Daniela Witten, and Su-In Lee. 2015. “Pathway Graphical Lasso.” In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 2617–23. AAAI’15. Austin, Texas: AAAI Press.
Hadj-Selem, Fouad, Tommy Lofstedt, Elvis Dohmatob, Vincent Frouin, Mathieu Dubois, Vincent Guillemot, and Edouard Duchesnay. 2018. Continuation of Nesterov’s Smoothing for Regression with Structured Sparsity in High-Dimensional Neuroimaging.” IEEE Transactions on Medical Imaging 2018. https://doi.org/10.1109/TMI.2018.2829802.
Hallac, David, Jure Leskovec, and Stephen Boyd. 2015. “Network Lasso: Clustering and Optimization in Large Graphs.” In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 387–96.
Hocking, T., Jean-Philippe Vert, F. Bach, and Armand Joulin. 2011. “Clusterpath: An Algorithm for Clustering Using Convex Fusion Penalties.” In ICML.
Honorio, Jean, Dimitris Samaras, Nikos Paragios, Rita Goldstein, and Luis E Ortiz. 2009. “Sparse and Locally Constant Gaussian Graphical Models.” Advances in Neural Information Processing Systems 22: 745–53.
Hsieh, Cho-Jui, Mátyás A. Sustik, Inderjit S. Dhillon, and Pradeep Ravikumar. 2014. “QUIC: Quadratic Approximation for Sparse Inverse Covariance Estimation.” Journal of Machine Learning Research 15 (83): 2911–47. http://jmlr.org/papers/v15/hsieh14a.html.
Koller, Daphne, and Nir Friedman. 2009. Probabilistic Graphical Models: Principles.” Italica 51 (3): 327. https://doi.org/10.2307/478142.
Kurtz, Christian L. AND Miraldi, Zachary D. AND Müller. 2015. “Sparse and Compositionally Robust Inference of Microbial Ecological Networks.” PLOS Computational Biology 11 (May): 1–25. https://doi.org/10.1371/journal.pcbi.1004226.
Lauritzen, Steffen L. 1996. Graphical models. Clarendon Press. https://global.oup.com/academic/product/graphical-models-9780198522195?cc=fr&lang=en&.
Lin, Meixia, Defeng Sun, Kim-Chuan Toh, and Chengjing Wang. 2020. “Estimation of Sparse Gaussian Graphical Models with Hidden Clustering Structure.” arXiv Preprint arXiv:2004.08115.
Lindsten, F., H. Ohlsson, and L. Ljung. 2011. “Clustering Using Sum-of-Norms Regularization: With Application to Particle Filter Output Computation.” In 2011 IEEE Statistical Signal Processing Workshop (SSP), 201–4. https://doi.org/10.1109/SSP.2011.5967659.
Liu, Han, Kathryn Roeder, and Larry Wasserman. 2010. Stability approach to regularization selection (StARS) for high dimensional graphical models.” Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010, NIPS 2010, 1–14.
Mazumder, Rahul, and Trevor Hastie. 2012. The graphical lasso: New insights and alternatives.” Electronic Journal of Statistics 6 (none): 2125–49. https://doi.org/10.1214/12-EJS740.
Meinshausen, Nicolai, and Peter Bühlmann. 2006. High-dimensional graphs and variable selection with the Lasso.” Annals of Statistics 34 (3): 1436–62. https://doi.org/10.1214/009053606000000281.
———. 2010. “Stability Selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72 (4): 417–73.
Nesterov, Yu. 2005. “Smooth Minimization of Non-Smooth Functions.” Mathematical Programming 103 (1): 127–52.
Newman, Mark EJ, Steven H Strogatz, and Duncan J Watts. 2001. “Random Graphs with Arbitrary Degree Distributions and Their Applications.” Physical Review E 64 (2): 026118.
Park, Mee Young, Trevor Hastie, and Robert Tibshirani. 2006. Averaged gene expressions for regression.” Biostatistics 8 (2): 212–27. https://doi.org/10.1093/biostatistics/kxl002.
Peng, Jie, Pei Wang, Nengfeng Zhou, and Ji Zhu. 2009. “Partial Correlation Estimation by Joint Sparse Regression Models.” Journal of the American Statistical Association 104 (486): 735–46. https://doi.org/10.1198/jasa.2009.0126.
Rocha, Guilherme V., Peng Zhao, and Bin Yu. 2008. “A Path Following Algorithm for Sparse Pseudo-Likelihood Inverse Covariance Estimation (SPLICE).”
Rothman, Adam J., Peter J. Bickel, Elizaveta Levina, and Ji Zhu. 2008. Sparse permutation invariant covariance estimation.” Electronic Journal of Statistics 2 (none): 494–515. https://doi.org/10.1214/08-EJS176.
Tan, Kean Ming, Daniela Witten, and Ali Shojaie. 2013. The Cluster Graphical Lasso for improved estimation of Gaussian graphical models,” July. http://arxiv.org/abs/1307.5339.
Tibshirani, R. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society (Series B) 58: 267–88.
Tibshirani, Robert, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. 2005. “Sparsity and Smoothness via the Fused Lasso.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (1): 91–108.
Yang, Eunho, Genevera Allen, Zhandong Liu, and Pradeep Ravikumar. 2012. “Graphical Models via Generalized Linear Models.” Advances in Neural Information Processing Systems 25.
Yao, Tianyi, and Genevera I. Allen. 2019. “Clustered Gaussian Graphical Model via Symmetric Convex Clustering.” In 2019 IEEE Data Science Workshop (DSW), 76–82. https://doi.org/10.1109/DSW.2019.8755774.
Yuan, Ming. 2010. “High Dimensional Inverse Covariance Matrix Estimation via Linear Programming.” Journal of Machine Learning Research 11 (79): 2261–86. http://jmlr.org/papers/v11/yuan10b.html.
Yuan, Ming, and Yi Lin. 2007. Model selection and estimation in the Gaussian graphical model.” Biometrika 94 (1): 19–35. https://doi.org/10.1093/biomet/asm018.
Zhao, Tuo, Han Liu, Kathryn Roeder, John Lafferty, and Larry Wasserman. 2012. “The Huge Package for High-Dimensional Undirected Graph Estimation in r.” The Journal of Machine Learning Research 13 (1): 1059–62.

Reuse

Citation

BibTeX citation:
@article{sanou,
  author = {Edmond Sanou and Christophe Ambroise and Geneviève Robin},
  title = {Inference of {Multiscale} {Gaussian} {Graphical} {Model}},
  journal = {Computo},
  date = {},
  url = {https://github.com/desanou/multiscale_glasso},
  doi = {xxxx},
  langid = {en},
  abstract = {Gaussian Graphical Models (GGMs) are widely used for
    exploratory data analysis in various fields such as genomics,
    ecology, psychometry. In a high-dimensional setting, when the number
    of variables exceeds the number of observations by several orders of
    magnitude, the estimation of GGM is a difficult and unstable
    optimization problem. Clustering of variables or variable selection
    is often performed prior to GGM estimation. We propose a new method
    allowing to simultaneously infer a hierarchical clustering structure
    and the graphs describing the structure of independence at each
    level of the hierarchy. This method is based on solving a convex
    optimization problem combining a graphical lasso penalty \textless
    edmond\textgreater{} in its neigborhood selection version
    \textless\textbackslash edmond\textgreater{} with a fused type lasso
    penalty. Results on real and synthetic data are presented.}
}
For attribution, please cite this work as:
Edmond Sanou, Christophe Ambroise, and Geneviève Robin. n.d. “Inference of Multiscale Gaussian Graphical Model.” Computo. https://doi.org/xxxx.